Let’s talk about data engineers’ nightmare
1 hour ago
As data engineers, we encounter unique challenges every day. But if there is one daunting task that stands out, it must be the backfill. A flawed backfill means excessive processing time, data contamination, and substantial cloud bills. And yeah, it also means you need one more backfill job to fix it.
Completing your first successful data backfill is a data engineering rite of passage. â Dagster
Backfill task demands a set of data engineering skills to be effectively accomplished such as domain knowledge to validate results, tooling expertise to run backfill jobs, and a solid understanding of the database to optimize the process. When all of these elements are intertwined within a single task, things can go wrong.
In this article, we will explore the concept of data backfilling, its necessity, and efficient implementation methods. Whether you are a beginner in backfilling or someone who often feels panic about such tasks, this article will calm your mind and help you regain your confidence.
What is backfill?
Backfill is the process of filling in missing data from the past on a new table that didn’t exist before, or replacing old data with new records. It’s usually not a recurring job and it’s necessary only for data pipelines that update the table incrementally.
For example, a table is partitioned on date column. A regular daily job updates just the latest 2 partitions. In contrast, a backfill job can update partitions all the way back to the initial one in the table. If the regular job updates the entire table each time, a backfill job becomes unnecessary as the historical data will naturally be updated through the regular job.
So, when do we need to backfill?
In general, there are a few common scenarios. Let’s see if you find them familiar.
Create a new table and want to fill in missing historical data